-
Notifications
You must be signed in to change notification settings - Fork 1
Dev v0.0.2 data preprocess #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… of preprocessing output
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
try: | ||
from IPython import get_ipython | ||
shell = get_ipython().__class__.__name__ | ||
if shell == 'ZMQInteractiveShell': | ||
print("Running in Jupyter Notebook") | ||
IN_NOTEBOOK = True | ||
else: | ||
print("Running in IPython shell") | ||
except NameError: | ||
print("Running in standard Python shell") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a handy utility function in the making. Consider elevating this to be reusable across other notebooks without duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
git_root = subprocess.check_output( | ||
["git", "rev-parse", "--show-toplevel"], text=True | ||
).strip() | ||
config_path = pathlib.Path(git_root) / "config.yml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid a subproc call (which can become a challenge), consider using a pattern whereby the Jupyter notebook has access by default to the root of the project. This can be configured using a settings.json
file if using VS Code to run notebooks. Alternatively, if using the Jupyter web interface, consider running Jupyter from the root of the project repo and navigating to the notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configured default wd of notebooks to be the project root and dropped the subproc call!
if not config_path.exists(): | ||
raise FileNotFoundError(f"Config file not found at: {config_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making use of pathlib.Path.resolve(strict=True)
, possibly embedded in the above variable label association.
data_cfg = config.get("data") | ||
if not data_cfg: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the use of the walrus operator here to help check and assign at the same time.
data_cfg = config.get("data") | |
if not data_cfg: | |
if not (data_cfg := config.get("data")): |
for key in required_keys: | ||
value = data_cfg.get(key) | ||
if value is None: | ||
results.append((key, None, "Missing in config")) | ||
errors.append(f"Config key '{key}' is missing") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing these checks made me wonder if it might make sense to use yaml
schema validation. It's common to use jsonschema
to do this because yaml
is a superset of json
. If you move in this direction it takes a lot of the guesswork out of validating whether you have an object of a certain structure in yaml
or json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed using json schema!
label.set_rotation(90) | ||
|
||
plt.tight_layout() | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to save the plots? It might make comparisons as things proceed more clear. This might mean making the notebook check more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made save plot default in both notebook/script mode and only skips showing plot in script!
api: | ||
openai: | ||
key: "YOUR_OPENAI_API_KEY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how this key is loaded consider the use of python-dotenv
to help keep the key out of a file checked into source. Otherwise, or maybe either way, be sure to gitignore this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will decide whether to adopt the change in future with more needed environment variables
script_path = repo_root / "analysis" / "0.data_wrangling" / "nbconverted" / "0.1.wrangle_depmap_prism_data.py" | ||
|
||
# Run from repo root so `git rev-parse --show-toplevel` and config.yml resolve | ||
result = subprocess.run( | ||
["python", str(script_path)], | ||
cwd=repo_root, | ||
capture_output=True, | ||
text=True, | ||
env={**os.environ}, # inherit env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using subproc to run this as a script, consider using Pythonic imports. If you install the work as part of the venv you can use something like from analysis import wrangle
. This might change the way you name or store things but would make the work more clear, Pythonic, and likely faster (Python's interpretation of the code would be faster than waiting for a new Python interpreter call to complete).
/data/processed/processed_depmap_prism_ic50.csv | ||
|
||
# Actual config.yml | ||
config.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding this file's details to the readme.
…arate package importable by all notebooks
…pacakge, also attempts to reduce dependency on subproc to retrieve the git project root with pathing util
…plots in script mode.
Thanks for reviewing Dave, merging! |
Summary
This PR adds a preprocessing notebook for the DepMap PRISM secondary drug repurposing dataset, producing a clean, deduplicated table of drug–cell line IC50 values for downstream agentic experiments.
Key Changes
Config file + template
config.yml.template
is included in place of the ignoredconfig.yml
.Notebook for deduplication, merging screens and generating summary visualization of the dataset
HTS002
,MTS010
).r^2
)Trivial unit pytest to ensure the script version of notebook runs.